Welcome to my MIS4470 (Practical Computing for Data Analysis) final project. My objective is to gain familiarity with the nflfastR API, improve my data cleaning/preparation skills, and improve my data modeling skills. I will be loading in data using the nflfastR package, creating exploratory plots using ggplot relating my response variable (fantasy points per game) to various independent variables that will be considered, and creating linear regression models for the Running Back (RB) and Wide Receiver (WR) positions. Also, I will create easy-to-read summary tables using the gt package.
Some terms that you will need to be familiar with as you go through my project:
Fantasy Football
Fantasy Points
Target Share
Air Yards Share
#loading NFL data
library(nflfastR)
library(nflreadr)
##
## Attaching package: 'nflreadr'
## The following objects are masked from 'package:nflfastR':
##
## load_pbp, load_player_stats
library(nflplotR)
## Warning: package 'nflplotR' was built under R version 4.1.3
#cleaning data
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#plotting team logos
library(ggimage)
#calculate ages
library(eeptools)
## Warning: package 'eeptools' was built under R version 4.1.3
#Creating summary tables
library(gt)
#MAE
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
The analysis I will conduct is going to focus on the 2020 and 2021 NFL seasons. I will load these into two separate dataframes.The field descriptions for the dataset can be found at this link.
pbp_20 <- load_pbp(2020)
pbp_21 <- load_pbp(2021)
The response variable of interest will be fantasy points per game. Fantasy football goes on during the regular season of the NFL, so I will filter the data to include the regular season only. Then, instead of play-by-play data that is supplied by the load_pbp() function that I used above, I am interested in data that summarizes the entire season for each player. This data can be obtained using the calculate_player_stats() function from nflfastR.
pbp_20 <- pbp_20 %>%
filter(season_type == "REG")
stats_2020 <- calculate_player_stats(pbp_20, weekly = FALSE)
stats_2020
## # A tibble: 634 x 46
## player_id player_name games recent_team completions attempts passing_yards
## <chr> <chr> <int> <chr> <int> <int> <dbl>
## 1 00-0019596 T.Brady 16 TB 401 610 4633
## 2 00-0020531 D.Brees 12 NO 275 390 2942
## 3 00-0022127 J.Witten 10 LV 0 0 0
## 4 00-0022787 M.Schaub 1 ATL 0 0 0
## 5 00-0022824 A.Lee 1 ARI 1 1 26
## 6 00-0022921 L.Fitzgerald 13 ARI 0 0 0
## 7 00-0022924 B.Roethlisbe~ 15 PIT 399 608 3803
## 8 00-0022942 P.Rivers 16 IND 369 543 4169
## 9 00-0023436 A.Smith 8 WAS 168 252 1582
## 10 00-0023459 A.Rodgers 16 GB 372 526 4299
## # ... with 624 more rows, and 39 more variables: passing_tds <int>,
## # interceptions <dbl>, sacks <dbl>, sack_yards <dbl>, sack_fumbles <int>,
## # sack_fumbles_lost <int>, passing_air_yards <dbl>,
## # passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## # passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## # carries <int>, rushing_yards <dbl>, rushing_tds <int>,
## # rushing_fumbles <dbl>, rushing_fumbles_lost <dbl>, ...
pbp_21 <- pbp_21 %>%
filter(season_type == "REG")
stats_2021 <- calculate_player_stats(pbp_21, weekly = FALSE)
stats_2021
## # A tibble: 655 x 46
## player_id player_name games recent_team completions attempts passing_yards
## <chr> <chr> <int> <chr> <int> <int> <dbl>
## 1 00-0019596 T.Brady 17 TB 485 719 5316
## 2 00-0022824 A.Lee 1 ARI 0 0 0
## 3 00-0022924 B.Roethlisbe~ 16 PIT 390 605 3740
## 4 00-0023459 Aa.Rodgers 16 GB 366 531 4115
## 5 00-0023682 R.Fitzpatrick 1 WAS 3 6 13
## 6 00-0024243 M.Lewis 13 GB 0 0 0
## 7 00-0024417 S.Koch 1 BAL 0 1 0
## 8 00-0025394 A.Peterson 4 SEA 0 0 0
## 9 00-0026035 D.Amendola 8 HOU 0 1 0
## 10 00-0026143 M.Ryan 17 ATL 375 560 3968
## # ... with 645 more rows, and 39 more variables: passing_tds <int>,
## # interceptions <dbl>, sacks <dbl>, sack_yards <dbl>, sack_fumbles <int>,
## # sack_fumbles_lost <int>, passing_air_yards <dbl>,
## # passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## # passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## # carries <int>, rushing_yards <dbl>, rushing_tds <int>,
## # rushing_fumbles <dbl>, rushing_fumbles_lost <dbl>, ...
During the EDA portion, I am planning on making plots that include team logos, colors, etc. There is a table within nflfastR that contains this information. I will load this information into my data using the left_join() function. The decode_player_ids() function in nflfastr is used to decode the player IDs in the play-by-play data set to match the IDs in the logo data set. I will eventually be joining 2021 fantasy points per game into the 2020 dataset, so I will rename the fantasy_points_ppr field to indicate that it is from the 2021 dataset.
# str(teams_colors_logos)
stats_2020 <- stats_2020 %>%
left_join(teams_colors_logos, by = c("recent_team" = "team_abbr"))
stats_2021 <- stats_2021 %>%
left_join(teams_colors_logos, by = c("recent_team" = "team_abbr")) %>%
rename(fantasy_points_ppr_21 = fantasy_points_ppr)
decode_player_ids(pbp_20, fast = TRUE)
## v {.field 18:06:38 | Decoding of player ids completed}
## # A tibble: 46,189 x 372
## play_id game_id old_game_id home_team away_team season_type week posteam
## <dbl> <chr> <chr> <chr> <chr> <chr> <int> <chr>
## 1 1 2020_01_AR~ 2020091311 SF ARI REG 1 <NA>
## 2 39 2020_01_AR~ 2020091311 SF ARI REG 1 SF
## 3 54 2020_01_AR~ 2020091311 SF ARI REG 1 SF
## 4 93 2020_01_AR~ 2020091311 SF ARI REG 1 SF
## 5 118 2020_01_AR~ 2020091311 SF ARI REG 1 SF
## 6 143 2020_01_AR~ 2020091311 SF ARI REG 1 SF
## 7 165 2020_01_AR~ 2020091311 SF ARI REG 1 SF
## 8 197 2020_01_AR~ 2020091311 SF ARI REG 1 SF
## 9 226 2020_01_AR~ 2020091311 SF ARI REG 1 ARI
## 10 245 2020_01_AR~ 2020091311 SF ARI REG 1 ARI
## # ... with 46,179 more rows, and 364 more variables: posteam_type <chr>,
## # defteam <chr>, side_of_field <chr>, yardline_100 <dbl>, game_date <chr>,
## # quarter_seconds_remaining <dbl>, half_seconds_remaining <dbl>,
## # game_seconds_remaining <dbl>, game_half <chr>, quarter_end <dbl>,
## # drive <dbl>, sp <dbl>, qtr <dbl>, down <dbl>, goal_to_go <dbl>, time <chr>,
## # yrdln <chr>, ydstogo <dbl>, ydsnet <dbl>, desc <chr>, play_type <chr>,
## # yards_gained <dbl>, shotgun <dbl>, no_huddle <dbl>, qb_dropback <dbl>, ...
In the interest of time, I am going to create models for the RB and WR positions only (it was getting way too long with QB and TE included). So, I need to create data frames for the WR and RB positions. The nflreadR package has a load_rosters() functions that will give us specific player name/position/team information. I will create a player_name1 field that matches the format of player_name in the stats_* data frames. I’ll include the head of one of these data frames so we can get an idea of what they contain.
WR_names20 <- load_rosters(2020) %>%
filter(position == "WR") %>%
select(position, full_name, first_name, last_name, team)
WR_names20$player_name1 <- paste(substr(WR_names20$first_name,1, 1), WR_names20$last_name, sep = ".")
WR_names21 <- load_rosters(2021) %>%
filter(position == "WR") %>%
select(position, full_name, first_name, last_name, team)
WR_names21$player_name1 <- paste(substr(WR_names21$first_name,1, 1), WR_names21$last_name, sep = ".")
RB_names20 <- load_rosters(2020) %>%
filter(position == "RB") %>%
select(position, full_name, first_name, last_name, team)
RB_names20$player_name1 <- paste(substr(RB_names20$first_name,1, 1), RB_names20$last_name, sep = ".")
RB_names21 <- load_rosters(2021) %>%
filter(position == "RB") %>%
select(position, full_name, first_name, last_name, team)
RB_names21$player_name1 <- paste(substr(RB_names21$first_name,1, 1), RB_names21$last_name, sep = ".")
head(RB_names20)
## # A tibble: 6 x 6
## position full_name first_name last_name team player_name1
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 RB Kenyan Drake Kenyan Drake ARI K.Drake
## 2 RB Chase Edmonds Chase Edmonds ARI C.Edmonds
## 3 RB Khalfani Muhammad Khalfani Muhammad ARI K.Muhammad
## 4 RB Eno Benjamin Eno Benjamin ARI E.Benjamin
## 5 RB D.J. Foster D.J. Foster ARI D.Foster
## 6 RB Jonathan Ward Jonathan Ward ARI J.Ward
Now that I have separated name information for each position, I am able to split the stats_2020 and stats_2021 data frames by position. One issue that I’ve experienced with the first initial . last name format is players that have the same first initial and last name. To address this, I will attach the players team to their name field before filitering the data.
RB_names20$nameteam <- paste(RB_names20$player_name1, RB_names20$team, sep = "")
RB_names21$nameteam <- paste(RB_names21$player_name1, RB_names21$team, sep = "")
WR_names20$nameteam <- paste(WR_names20$player_name1, WR_names20$team, sep = "")
WR_names21$nameteam <- paste(WR_names21$player_name1, WR_names21$team, sep = "")
stats_2020$nameteam <- paste(stats_2020$player_name, stats_2020$recent_team, sep = "")
stats_2021$nameteam <- paste(stats_2021$player_name, stats_2021$recent_team, sep = "")
WR_names21 <- WR_names21 %>%
mutate(nameteam = replace(nameteam, nameteam == "D.MooreCAR", "Dj.MooreCAR"))
Now that I have prepared the data to be separated by position, I will join the stats_* data frames to the names data frames. For RBs, I will filter to RBs that had more than 30 carries to filter out the players that were not being considered for fantasy lineups.
rbstats_2020 <- RB_names20 %>%
filter(nameteam %in% stats_2020$nameteam) %>%
left_join(stats_2020, by = "nameteam") %>%
filter(carries > 30) %>%
arrange(-fantasy_points_ppr)
head(rbstats_2020)
## # A tibble: 6 x 63
## position full_name first_name last_name team player_name1 nameteam player_id
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 RB Alvin Kam~ Alvin Kamara NO A.Kamara A.Kamar~ 00-00339~
## 2 RB Dalvin Co~ Dalvin Cook MIN D.Cook D.CookM~ 00-00338~
## 3 RB Derrick H~ Derrick Henry TEN D.Henry D.Henry~ 00-00327~
## 4 RB David Mon~ David Montgome~ CHI D.Montgomery D.Montg~ 00-00356~
## 5 RB Aaron Jon~ Aaron Jones GB A.Jones A.Jones~ 00-00332~
## 6 RB Jonathan ~ Jonathan Taylor IND J.Taylor J.Taylo~ 00-00362~
## # ... with 55 more variables: player_name <chr>, games <int>,
## # recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## # passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## # sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## # passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## # passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## # carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...
stats_2021
## # A tibble: 655 x 57
## player_id player_name games recent_team completions attempts passing_yards
## <chr> <chr> <int> <chr> <int> <int> <dbl>
## 1 00-0019596 T.Brady 17 TB 485 719 5316
## 2 00-0022824 A.Lee 1 ARI 0 0 0
## 3 00-0022924 B.Roethlisbe~ 16 PIT 390 605 3740
## 4 00-0023459 Aa.Rodgers 16 GB 366 531 4115
## 5 00-0023682 R.Fitzpatrick 1 WAS 3 6 13
## 6 00-0024243 M.Lewis 13 GB 0 0 0
## 7 00-0024417 S.Koch 1 BAL 0 1 0
## 8 00-0025394 A.Peterson 4 SEA 0 0 0
## 9 00-0026035 D.Amendola 8 HOU 0 1 0
## 10 00-0026143 M.Ryan 17 ATL 375 560 3968
## # ... with 645 more rows, and 50 more variables: passing_tds <int>,
## # interceptions <dbl>, sacks <dbl>, sack_yards <dbl>, sack_fumbles <int>,
## # sack_fumbles_lost <int>, passing_air_yards <dbl>,
## # passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## # passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## # carries <int>, rushing_yards <dbl>, rushing_tds <int>,
## # rushing_fumbles <dbl>, rushing_fumbles_lost <dbl>, ...
rbstats_2021 <- RB_names21 %>%
filter(nameteam %in% stats_2021$nameteam) %>%
left_join(stats_2021, by = "nameteam") %>%
filter(carries > 30) %>%
arrange(-fantasy_points_ppr_21)
head(rbstats_2021)
## # A tibble: 6 x 63
## position full_name first_name last_name team player_name1 nameteam player_id
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 RB Jonathan ~ Jonathan Taylor IND J.Taylor J.Taylo~ 00-00362~
## 2 RB Austin Ek~ Austin Ekeler LAC A.Ekeler A.Ekele~ 00-00336~
## 3 RB Najee Har~ Najee Harris PIT N.Harris N.Harri~ 00-00368~
## 4 RB Joe Mixon Joe Mixon CIN J.Mixon J.Mixon~ 00-00338~
## 5 RB James Con~ James Conner ARI J.Conner J.Conne~ 00-00335~
## 6 RB Leonard F~ Leonard Fournette TB L.Fournette L.Fourn~ 00-00338~
## # ... with 55 more variables: player_name <chr>, games <int>,
## # recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## # passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## # sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## # passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## # passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## # carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...
I’ll filter WRs with at least 20 receptions to filter out WRs that were never considered to be added to fantasy lineups.
wrstats_2020 <- WR_names20 %>%
filter(nameteam %in% stats_2020$nameteam) %>%
left_join(stats_2020, by = "nameteam") %>%
filter(receptions > 20) %>%
arrange(-fantasy_points_ppr)
head(wrstats_2020)
## # A tibble: 6 x 63
## position full_name first_name last_name team player_name1 nameteam player_id
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 WR Davante A~ Davante Adams GB D.Adams D.Adams~ 00-00313~
## 2 WR Tyreek Hi~ Tyreek Hill KC T.Hill T.HillKC 00-00330~
## 3 WR Stefon Di~ Stefon Diggs BUF S.Diggs S.Diggs~ 00-00315~
## 4 WR DeAndre H~ DeAndre Hopkins ARI D.Hopkins D.Hopki~ 00-00305~
## 5 WR Calvin Ri~ Calvin Ridley ATL C.Ridley C.Ridle~ 00-00348~
## 6 WR Justin Je~ Justin Jefferson MIN J.Jefferson J.Jeffe~ 00-00363~
## # ... with 55 more variables: player_name <chr>, games <int>,
## # recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## # passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## # sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## # passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## # passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## # carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...
wrstats_2020
## # A tibble: 105 x 63
## position full_name first_name last_name team player_name1 nameteam player_id
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 WR Davante ~ Davante Adams GB D.Adams D.Adams~ 00-00313~
## 2 WR Tyreek H~ Tyreek Hill KC T.Hill T.HillKC 00-00330~
## 3 WR Stefon D~ Stefon Diggs BUF S.Diggs S.Diggs~ 00-00315~
## 4 WR DeAndre ~ DeAndre Hopkins ARI D.Hopkins D.Hopki~ 00-00305~
## 5 WR Calvin R~ Calvin Ridley ATL C.Ridley C.Ridle~ 00-00348~
## 6 WR Justin J~ Justin Jefferson MIN J.Jefferson J.Jeffe~ 00-00363~
## 7 WR DK Metca~ DK Metcalf SEA D.Metcalf D.Metca~ 00-00356~
## 8 WR Tyler Lo~ Tyler Lockett SEA T.Lockett T.Locke~ 00-00322~
## 9 WR Allen Ro~ Allen Robinson CHI A.Robinson A.Robin~ 00-00314~
## 10 WR Adam Thi~ Adam Thielen MIN A.Thielen A.Thiel~ 00-00300~
## # ... with 95 more rows, and 55 more variables: player_name <chr>, games <int>,
## # recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## # passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## # sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## # passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## # passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## # carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...
wrstats_2021 <- WR_names21 %>%
filter(nameteam %in% stats_2021$nameteam) %>%
left_join(stats_2021, by = "nameteam") %>%
filter(receptions > 20) %>%
arrange(-fantasy_points_ppr_21)
head(wrstats_2021)
## # A tibble: 6 x 63
## position full_name first_name last_name team player_name1 nameteam player_id
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 WR Cooper Ku~ Cooper Kupp LA C.Kupp C.KuppLA 00-00339~
## 2 WR Davante A~ Davante Adams GB D.Adams D.Adams~ 00-00313~
## 3 WR Deebo Sam~ Deebo Samuel SF D.Samuel D.Samue~ 00-00357~
## 4 WR Justin Je~ Justin Jefferson MIN J.Jefferson J.Jeffe~ 00-00363~
## 5 WR Ja'Marr C~ Ja'Marr Chase CIN J.Chase J.Chase~ 00-00369~
## 6 WR Tyreek Hi~ Tyreek Hill KC T.Hill T.HillKC 00-00330~
## # ... with 55 more variables: player_name <chr>, games <int>,
## # recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## # passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## # sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## # passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## # passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## # carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...
Instead of using 2020 as a train set and 2021 as a test set, I’d like to join 2021 fantasy points per game to the 2020 data set. The reason for this is because fantasy points are calculated as a function of yards, receptions, touchdowns etc. So, if I were to create a linear model for 2020 fantasy points using those metrics from 2020, I would get a nearly perfect model. My goal is to attempt to model 2021 fantasy points per game using various 2020 metrics to see how well we can model the following year’s fantasy points.
rbstats_2021 <- rbstats_2021 %>%
mutate(fantasy_PPG_21 = fantasy_points_ppr_21 / games) %>%
select(nameteam, fantasy_PPG_21)
wrstats_2021 <- wrstats_2021 %>%
mutate(fantasy_PPG_21 = fantasy_points_ppr_21 / games) %>%
select(nameteam, fantasy_PPG_21)
rbstats_2020 <- rbstats_2020 %>%
left_join(rbstats_2021, by = "nameteam") %>%
mutate(fantasy_PPG_last_yr = fantasy_points_ppr / games)
wrstats_2020 <- wrstats_2020 %>%
left_join(wrstats_2021, by = "nameteam") %>%
mutate(fantasy_PPG_last_yr = fantasy_points_ppr / games)
head(wrstats_2020)
## # A tibble: 6 x 65
## position full_name first_name last_name team player_name1 nameteam player_id
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 WR Davante A~ Davante Adams GB D.Adams D.Adams~ 00-00313~
## 2 WR Tyreek Hi~ Tyreek Hill KC T.Hill T.HillKC 00-00330~
## 3 WR Stefon Di~ Stefon Diggs BUF S.Diggs S.Diggs~ 00-00315~
## 4 WR DeAndre H~ DeAndre Hopkins ARI D.Hopkins D.Hopki~ 00-00305~
## 5 WR Calvin Ri~ Calvin Ridley ATL C.Ridley C.Ridle~ 00-00348~
## 6 WR Justin Je~ Justin Jefferson MIN J.Jefferson J.Jeffe~ 00-00363~
## # ... with 57 more variables: player_name <chr>, games <int>,
## # recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## # passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## # sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## # passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## # passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## # carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...
Later in this analysis, we will find out that many of the explanatory variables we will try to use are highly correlated. Because of this, I would like to introduce player age, height, and weight to the data set as potential explanatory variables. We can obtain age/height/weight information using the fast_scraper_roster() function from the nflfastR package. I will use the age_calc() function to calculate the players age as of the end of the 2021 NFL season, which was on January 9, 2022. The heights and weights are loaded in as character data, so I will convert them to numeric.
RB_age_ht_wt <- fast_scraper_roster(2021) %>%
filter(position == "RB", full_name %in% rbstats_2020$full_name) %>%
select(full_name, birth_date, height, weight) %>%
drop_na(birth_date) %>%
mutate(age = age_calc(birth_date, enddate = as.Date("2022-01-09"), units = "years", precise = TRUE),
height = as.numeric(height),
weight = as.numeric(weight))
head(RB_age_ht_wt)
## # A tibble: 6 x 5
## full_name birth_date height weight age
## <chr> <date> <dbl> <dbl> <dbl>
## 1 Chase Edmonds 1996-04-13 69 210 25.7
## 2 James Conner 1995-05-05 73 233 26.7
## 3 Mike Davis 1993-02-19 69 220 28.9
## 4 Gus Edwards 1995-04-13 73 238 26.7
## 5 J.K. Dobbins 1998-12-17 70 214 23.1
## 6 Latavius Murray 1990-01-18 75 230 32.0
WR_age_ht_wt <- fast_scraper_roster(2021) %>%
filter(position == "WR", full_name %in% wrstats_2020$full_name) %>%
select(full_name, birth_date, height, weight) %>%
drop_na(birth_date) %>%
mutate(age = age_calc(birth_date, enddate = as.Date("2022-01-09"), units = "years", precise = TRUE),
height = as.numeric(height),
weight = as.numeric(weight))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
head(WR_age_ht_wt)
## # A tibble: 6 x 5
## full_name birth_date height weight age
## <chr> <date> <dbl> <dbl> <dbl>
## 1 A.J. Green 1988-07-31 76 207 33.4
## 2 Andy Isabella 1996-11-18 69 188 25.1
## 3 DeAndre Hopkins 1992-06-06 73 212 29.6
## 4 Christian Kirk 1996-11-18 71 200 25.1
## 5 Calvin Ridley 1994-12-20 73 190 27.1
## 6 Russell Gage 1996-01-22 72 184 26.0
I will now join the age, height, and weight information to my positional data frames.
rbstats_2020 <- rbstats_2020 %>%
left_join(RB_age_ht_wt, by = "full_name") %>%
drop_na(fantasy_PPG_21)
wrstats_2020 <- wrstats_2020 %>%
left_join(WR_age_ht_wt, by = "full_name") %>%
drop_na(fantasy_PPG_21)
rbstats_2020 <- rbstats_2020 %>%
drop_na(age)
wrstats_2020 <- wrstats_2020 %>%
drop_na(age)
The nflfastR package has some nice capabilities when it comes to making plots. Earlier, I joined the teams_colors_logos table to my data frame. This table will allow us to create plots with team logos or player headshots. I will now create some exploratory plots using 2021 fantasy points per game as the y variable with various 2020 metrics as the x value.
Naturally, the first thing we will want to ask ourself is whether or not fantasy points per game (FPPG) are consistent from year to year. That is, are a gvien years FPPG useful in predicting the following years FPPG?
#team logo example
ggplot(rbstats_2020, aes(x = fantasy_PPG_last_yr, y = fantasy_PPG_21)) +
geom_image(aes(image = team_logo_espn), asp = 16/9, size = 0.05) +
labs(x = "2020 Fantasy Points per Game", y = "2021 Fantasy Points per Game", caption = "Data: nflfastR", title = "Fantasy Points per Game: 2021 vs. 2020") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
#player headshot example
ggplot(rbstats_2020, aes(x = fantasy_PPG_last_yr, y = fantasy_PPG_21)) +
geom_nfl_headshots(aes(player_gsis = player_id, height = 0.1)) +
labs(x = "2020 Fantasy Points per Game", y = "2021 Fantasy Points per Game", caption = "Data: nflfastR", title = "RB Fantasy Points per Game: 2021 vs. 2020") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
Now that I’ve shown how to create plots with team logos or headshots, I will switch back to normal plotting in order to save time. The logo/heashot plots tend to take a while to run. Let’s see if touchdowns per game in 2020 seem related to fantasy points in 2020. As I go, I will create per game metrics as needed.
rbstats_2020$rushing_td_PG_20 <- rbstats_2020$rushing_tds / rbstats_2020$games
rbstats_2020$receiving_td_PG_20 <- rbstats_2020$receiving_tds / rbstats_2020$games
wrstats_2020$receiving_td_PG_20 <- wrstats_2020$receiving_tds / wrstats_2020$games
ggplot(rbstats_2020, aes(x = rushing_td_PG_20, y = fantasy_PPG_21)) +
geom_point() +
labs(x = "2020 Rushing TD per Game", y = "2021 Fantasy Points per Game", caption = "Data: nflfastR", title = "Running Backs: 2021 Fantasy PPG vs. 2020 Rushing TD per Game") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
ggplot(rbstats_2020, aes(x = receiving_td_PG_20, y = fantasy_PPG_21)) +
geom_point() +
labs(x = "2020 Receiving TD per Game", y = "2021 Fantasy Points per Game", caption = "Data: nflfastR", title = "Running Backs: 2021 Fantasy PPG vs. 2020 Receiving TD per Game") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
Neither one of these plots have a very strong trend, but rushing TD per game seems to have more of a relationship with 2021 FPPG than receiving TD per game. Let’s take a look at a plot visualizing 2020 FPPG vs. receiving TD per game for WRs.
ggplot(wrstats_2020, aes(x = receiving_td_PG_20, y = fantasy_PPG_21)) +
geom_point() +
labs(x = "2020 Receiving TD per Game", y = "2021 Fantasy PPG", title = "Wide Receivers: 2021 Fantasy PPG vs 2020 Receiving TD per Game", caption = "Data:nflfastR")+
theme(plot.title = element_text(face = "bold", hjust = 0.5))
Again, not a very strong trend. Perhaps we shouldn’t worry too much about how many touchdowns a player scored in a given year when projecting that player in fantasy football for the following season. Let’s take a look at carries per game for running backs and receptions per game for both running backs and wide receivers.
rbstats_2020$carriesPG_20 <- rbstats_2020$carries / rbstats_2020$games
rbstats_2020$receptionsPG_20 <- rbstats_2020$receptions / rbstats_2020$games
wrstats_2020$receptionsPG_20 <- wrstats_2020$receptions / wrstats_2020$games
ggplot(rbstats_2020) +
geom_point(aes(x = carriesPG_20, y = fantasy_PPG_21)) +
labs(x = "2020 Carries per Game", y = "2021 Fantasy PPG", caption = "Data: nflfastR", title = "Running Backs: 2021 FPPG vs. 2020 Carries per Game") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
ggplot(rbstats_2020) +
geom_point(aes(x = receptionsPG_20, y = fantasy_PPG_21)) +
labs(x = "2020 Receptions per Game", y = "2021 Fantasy PPG", caption = "Data: nflfastR", title = "Running Backs: 2021 FPPG vs. 2020 Receptions per Game") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
Not the strongest of trends, but there appears to be more of a relationship between these metrics and 2021 FPPG than there was with the TD counts. Let’s take a look at the relationship between 2020 receptions per game and 2021 FPPG for the WR position.
ggplot(wrstats_2020, aes(x = receptionsPG_20, y = fantasy_PPG_21)) +
geom_point() +
labs(x = "2020 Receptions per Game", y = "2021 Fantasy PPG", caption = "Data: nflfastR", title = "Wide Receivers: 2021 FPPG vs. 2020 Receptions per Game") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
There appears to be a positive trend between 2020 receptions per game and 2021 FPPG for the WR position. So far, we can gather that we will likely want to include carries/receptions per game rather than touchdowns per game from 2020 when predicting 2021 FPPG. Let’s take a look at some plots relating age/height/weight to FPPG for both positions.
ggplot(rbstats_2020) +
geom_point(aes(x = age, y = fantasy_PPG_21)) +
labs(x = "Age", y = "2021 Fantasy PPG", title = "Running Backs: 2021 Fantasy PPG by Age", caption = "Data: nflfastR | Age as of the end of the 2021 NFL season") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
ggplot(rbstats_2020) +
geom_point(aes(x = weight, y = fantasy_PPG_21)) +
labs(x = "Weight", y = "2021 Fantasy PPG", title = "Running Backs: 2021 Fantasy PPG by Weight", caption = "Data: nflfastR") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
ggplot(rbstats_2020) +
geom_point(aes(x = height, y = fantasy_PPG_21)) +
labs(x = "Height", y = "2021 Fantasy PPG", title = "Running Backs: 2021 Fantasy PPG by Height", caption = "Data: nflfastR") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
Not much of a trend for any of age/height/weight for the RB position. Perhaps a slightly positive trend with height/weight. Let’s take a look at the same explanatory variables for the WR position.
ggplot(wrstats_2020) +
geom_point(aes(x = age, y = fantasy_PPG_21)) +
labs(x = "Age", y = "2021 Fantasy PPG", title = "Wide Receivers: 2021 Fantasy PPG by Age", caption = "Data: nflfastR | Age as of the end of the 2021 NFL season") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
ggplot(wrstats_2020) +
geom_point(aes(x = weight, y = fantasy_PPG_21)) +
labs(x = "Weight", y = "2021 Fantasy PPG", title = "Wide Receivers: 2021 Fantasy PPG by Weight", caption = "Data: nflfastR") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
ggplot(wrstats_2020) +
geom_point(aes(x = height, y = fantasy_PPG_21)) +
labs(x = "Height", y = "2021 Fantasy PPG", title = "Wide Receivers: 2021 Fantasy PPG by Height", caption = "Data: nflfastR") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
Again, no obvious trends. Regardless, we will try some models that include these metrics to see how they do. We will take a look at a few more per-game metrics before moving into the modeling phase. Let’s take a look at rushing yards per game for RBs and receiving yards per game for both RBs and WRs.
rbstats_2020$rushing_yards_PG_20 <- rbstats_2020$rushing_yards / rbstats_2020$games
rbstats_2020$receiving_yards_PG_20 <- rbstats_2020$receiving_yards / rbstats_2020$games
wrstats_2020$receiving_yards_PG_20 <- wrstats_2020$receiving_yards / wrstats_2020$games
ggplot(rbstats_2020) +
geom_point(aes(x = rushing_yards_PG_20, y = fantasy_PPG_21)) +
labs(x = "2020 Rushing Yards per Game", y = "2021 Fantasy Points per Game", title = "Running Backs: 2021 Fantasy PPG by 2020 Rushing Yards per Game", caption = "Data: nflfastR") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
ggplot(rbstats_2020) +
geom_point(aes(x = receiving_yards_PG_20, y = fantasy_PPG_21)) +
labs(x = "2020 Receiving Yards per Game", y = "2021 Fantasy Points per Game", title = "Running Backs: 2021 Fantasy PPG by 2020 Receiving Yards per Game", caption = "Data: nflfastR") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
There appears to be a positive trend between these two explanatory variables and 2021 FPPG. The concern will be how correlated this explanatory variable is with other potential explanatory variables, which might lead to multicollinearity, causing a more difficult interpretation of the model.
ggplot(wrstats_2020) +
geom_point(aes(x = receiving_yards_PG_20, y = fantasy_PPG_21)) +
labs(x = "2020 Receiving Yards per Game", y = "2021 Fantasy Points per Game", title = "Wide Receivers: 2021 Fantasy PPG by 2020 Receiving Yards per Game", caption = "Data: nflfastR") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
There also appears to be a positive trend between 2020 receiving yards per game and 2021 FPPG for WRs. Two more variables we will take a look at for the WR positon are air yards share and target share.
ggplot(wrstats_2020, aes(x = air_yards_share, y = fantasy_PPG_21)) +
geom_point() +
labs(x = "2020 Air Yards Share", y = "2021 Fantasy Points per Game", title = "Wide Receivers: 2021 Fantasy PPG vs. 2020 Air Yards Share", caption = "Data: nflfastR")+
theme(plot.title = element_text(face = "bold", hjust = 0.5))
ggplot(wrstats_2020, aes(x = target_share, y = fantasy_PPG_21)) +
geom_point() +
labs(x = "2020 Target Share", y = "2021 Fantasy Points per Game", title = "Wide Receivers: 2021 Fantasy PPG vs. 2020 Target Share", caption = "Data: nflfastR")+
theme(plot.title = element_text(face = "bold", hjust = 0.5))
These two variables (mainly target share) appear to have a positive relationship with 2021 FPPG for WRs. We will definitely attempt some models that include these variables.
Now that we have taken a look at some exploratory plots, lets try some models. I may make some more per game metrics as I move forward.
For the null model, we will predict the average FPPG for every RB. Hopefully we will be able to improve from the null model by adding explanatory variables.
RB_model0 <- lm(fantasy_PPG_21 ~ 1, data = rbstats_2020)
summary(RB_model0)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ 1, data = rbstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.506 -4.420 0.184 3.536 12.649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.5137 0.7874 14.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.34 on 45 degrees of freedom
RB_MAE0 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model0$fit)
RB_MAE0
## [1] 4.306429
So, if we predict the average 2021 FPPG for every RB, our mean absolute error is 4.306. That is, on average, our prediction is 4.306 points away from the true FPPG. Hopefully, we will be able to improve this by adding in some explanatory variable. The variable that seemed to show the strongest trend with 2021 FPPG was 2020 FPPG. Let’s see how a model does with this variable as the lone explanatory variable.
RB_model1 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr, data = rbstats_2020)
summary(RB_model1)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr, data = rbstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.562 -2.397 0.006 2.241 7.502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.79720 1.25526 3.025 0.00414 **
## fantasy_PPG_last_yr 0.63254 0.09234 6.850 1.89e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.757 on 44 degrees of freedom
## Multiple R-squared: 0.5161, Adjusted R-squared: 0.5051
## F-statistic: 46.93 on 1 and 44 DF, p-value: 1.895e-08
MAE_RB1 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model1$fit)
MAE_RB1
## [1] 2.895897
After adding 2020 FPPG as an explanatory variable, we already see a large improvement in our model. We went from an MAE of 4.3 in the null model to under 3 in this model. Let’s see if we can improve by adding some more explanatory variables. Let’s add 2020 carries per game and receptions per game, both of which seemed to have a positive relationship with 2021 FPPG.
RB_model2 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + receptionsPG_20, data = rbstats_2020)
summary(RB_model2)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 +
## receptionsPG_20, data = rbstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.908 -2.192 -0.112 2.240 7.878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5615 1.5596 2.284 0.0275 *
## fantasy_PPG_last_yr 0.5537 0.3014 1.837 0.0733 .
## carriesPG_20 0.1381 0.2609 0.529 0.5994
## receptionsPG_20 -0.1298 0.7893 -0.164 0.8702
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.811 on 42 degrees of freedom
## Multiple R-squared: 0.5246, Adjusted R-squared: 0.4907
## F-statistic: 15.45 on 3 and 42 DF, p-value: 6.421e-07
MAE_RB2 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model2$fit)
MAE_RB2
## [1] 2.846369
We see a slight dip in adjusted \(R^2\), but we also see a slightly lower MAE. So, this model is comparable to the last model, it is not notably better or worse. Let’s see if 2020 rushing yards or receiving yards per game help improve the model.
RB_model3 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + receptionsPG_20 + rushing_yards_PG_20 + receiving_yards_PG_20, data = rbstats_2020)
summary(RB_model3)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 +
## receptionsPG_20 + rushing_yards_PG_20 + receiving_yards_PG_20,
## data = rbstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2675 -2.0723 -0.1874 2.3028 7.8808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.38804 1.58119 2.143 0.0383 *
## fantasy_PPG_last_yr 0.39509 0.38525 1.026 0.3113
## carriesPG_20 -0.27476 0.40360 -0.681 0.4999
## receptionsPG_20 1.09550 1.81812 0.603 0.5502
## rushing_yards_PG_20 0.10766 0.08811 1.222 0.2289
## receiving_yards_PG_20 -0.08836 0.23084 -0.383 0.7039
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.82 on 40 degrees of freedom
## Multiple R-squared: 0.5452, Adjusted R-squared: 0.4884
## F-statistic: 9.592 on 5 and 40 DF, p-value: 4.503e-06
MAE_RB3 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model3$fit)
MAE_RB3
## [1] 2.797849
One thing that you might notice from the last model to this one is that the sign of the coefficient on the carriesPG_20 variable switched signs. We are likely starting to see some multicollinearity within our model, which happens when explanatory variables are highly correlated. This would make sense, as we would expect carries and yards to be highly correlated; the RBs that get more carries will pile up more yards. We can use this same idea with receptions and receiving yards. Let’s see if adding in age, height, and weight improves the model at all.
RB_model4 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + receptionsPG_20 + rushing_yards_PG_20 + receiving_yards_PG_20 + age + height + weight, data = rbstats_2020)
summary(RB_model4)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 +
## receptionsPG_20 + rushing_yards_PG_20 + receiving_yards_PG_20 +
## age + height + weight, data = rbstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6328 -2.5294 -0.2105 2.5704 6.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.82800 25.63660 -0.344 0.7325
## fantasy_PPG_last_yr 0.21840 0.39130 0.558 0.5801
## carriesPG_20 -0.28608 0.40996 -0.698 0.4897
## receptionsPG_20 1.37543 1.80083 0.764 0.4498
## rushing_yards_PG_20 0.11964 0.08908 1.343 0.1874
## receiving_yards_PG_20 -0.03051 0.23592 -0.129 0.8978
## age 0.19833 0.35501 0.559 0.5798
## height -0.32616 0.42305 -0.771 0.4456
## weight 0.14096 0.06673 2.112 0.0415 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.711 on 37 degrees of freedom
## Multiple R-squared: 0.603, Adjusted R-squared: 0.5171
## F-statistic: 7.024 on 8 and 37 DF, p-value: 1.317e-05
MAE_RB4 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model4$fit)
MAE_RB4
## [1] 2.73875
This model has our lowest MAE yet. Let’s try to drop some of the variables that might be causing multicollinearity issues and see if we can maintain a similar MAE.
RB_model5 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + receptionsPG_20 + age + height + weight, data = rbstats_2020)
summary(RB_model5)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 +
## receptionsPG_20 + age + height + weight, data = rbstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1621 -2.4791 -0.2117 2.5034 7.9893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.48710 24.33440 -0.184 0.8547
## fantasy_PPG_last_yr 0.43847 0.29852 1.469 0.1499
## carriesPG_20 0.15801 0.26187 0.603 0.5497
## receptionsPG_20 0.42955 0.83630 0.514 0.6104
## age 0.27974 0.34699 0.806 0.4250
## height -0.39404 0.40763 -0.967 0.3397
## weight 0.13356 0.06599 2.024 0.0498 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.71 on 39 degrees of freedom
## Multiple R-squared: 0.5817, Adjusted R-squared: 0.5174
## F-statistic: 9.039 on 6 and 39 DF, p-value: 3.325e-06
MAE_RB5 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model5$fit)
MAE_RB5
## [1] 2.762015
So, we were able to maintain a similarly effective model while also getting rid of some confusing terms. However, carries and receptions from 2020 might also be correlated with 2020 FPPG. Let’s see how the model does if we drop those terms.
RB_model6 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + age + height + weight, data = rbstats_2020)
summary(RB_model6)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + age + height +
## weight, data = rbstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1258 -2.7274 -0.0569 2.4853 8.5221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.55544 22.69050 -0.245 0.8078
## fantasy_PPG_last_yr 0.61946 0.09104 6.805 3.11e-08 ***
## age 0.26952 0.32308 0.834 0.4090
## height -0.34917 0.38994 -0.895 0.3758
## weight 0.12770 0.05733 2.227 0.0315 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.637 on 41 degrees of freedom
## Multiple R-squared: 0.5773, Adjusted R-squared: 0.5361
## F-statistic: 14 on 4 and 41 DF, p-value: 2.766e-07
MAE_RB6 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model6$fit)
MAE_RB6
## [1] 2.750365
This is actually our best performing model yet. It isn’t an outstanding model with an adjusted \(R^2\) of 0.5361 and an MAE of 2.75, but it is an improvement from where we started. On average, we are predicting 2.75 points off the true 2021 FPPG for each player. Before moving on to WR, let’s take a look at the actual vs. predicted plot for our best model.
best_RB_model <- data.frame(rbstats_2020$fantasy_PPG_21, predict(RB_model6))
names(best_RB_model) <- c("Actual", "Predicted")
ggplot(best_RB_model) +
geom_point(aes(x = Actual, y = Predicted)) +
geom_abline() +
labs(y = "Predicted Values", x = "Actual Values", title = "Actual Fantasy PPG vs. Predicted Fantasy PPG (RBs)") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
ggplot(data=RB_model6, aes(x = .fitted, y = .resid))+
geom_point() +
ggtitle("Checking for Constant Variance") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
Fortunately, our actual vs. predicted plot as well as our constant variance plot look pretty good. It would be nice to have the actual vs. predicted more tightly scattered around \(y=x\), but overall the model does somewhat well. Before moving on, I’d like to create a table that summarizes the findings of my model. The idea is that this table format would be somewhat easy to interpret for someone who is not familiar with regression. I am going to filter to include only the top 24 RBs in 2021 fantasy points per game and include a table of their name, age, height, and weight, their PPG in 2021, as well as their predicted 2021 FPPG and the residual. This will give the audience an idea of where the model did well and where it did not do well. Given that I am very familiar with the players and the things that happened during the 2021 season, a summary table like this where I can see which players the model didn’t do very well on might also give me some inspiration on more independent variables to include in the future. I will also demonstrate how to use conditional formatting in a gt table, which I will apply to the residual column.
#install.packages("remotes")
#remotes::install_github("jthomasmock/gtExtras")
rbstats_2020$predicted_fantasy_PPG_21 <- predict(RB_model6)
rbstats_2020$residual <- rbstats_2020$fantasy_PPG_21 - rbstats_2020$predicted_fantasy_PPG_21
rbstats_2020$PPG_21_rank <- rank(-rbstats_2020$fantasy_PPG_21)
RB_summary_table <- rbstats_2020 %>%
filter(PPG_21_rank <= 24) %>%
select(full_name, team_logo_espn, age, height, weight, fantasy_PPG_last_yr, fantasy_PPG_21, predicted_fantasy_PPG_21, residual, PPG_21_rank) %>%
arrange(-fantasy_PPG_21) %>%
gt() %>%
gtExtras::gt_img_rows(team_logo_espn) %>%
cols_label(full_name = "Player", team_logo_espn = "Team", age = "Age", height = "Height (inches)", weight = "Weight (pounds)", fantasy_PPG_last_yr = "2020 FPPG", fantasy_PPG_21 = "2021 FPPG", predicted_fantasy_PPG_21 = "Predicted 2021 FPPG", residual = "Residual", PPG_21_rank = "2021 FPPG Ranking") %>%
fmt_number(c(age, fantasy_PPG_last_yr, fantasy_PPG_21, predicted_fantasy_PPG_21),
decimals = 2) %>%
tab_header(title = "RB Model Summary", subtitle = "Table includes top 24 RBs (non-rookies) in 2021 FPPG (PPR) | Data: nflfastR")
#Applying conditional formatting to residual column
RB_summary_table %>%
data_color(
columns = c(residual),
colors = scales::col_numeric(
c("#f87274", "#FFFFFF", "#f87274"),
domain = c(-9, 9)
)
) # %>%
| RB Model Summary | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Table includes top 24 RBs (non-rookies) in 2021 FPPG (PPR) | Data: nflfastR | |||||||||
| Player | Team | Age | Height (inches) | Weight (pounds) | 2020 FPPG | 2021 FPPG | Predicted 2021 FPPG | Residual | 2021 FPPG Ranking |
| Derrick Henry | 28.01 | 75 | 247 | 20.82 | 24.16 | 20.25 | 3.9165223 | 1 | |
| Jonathan Taylor | 22.97 | 70 | 226 | 16.85 | 21.95 | 15.49 | 6.4520778 | 2 | |
| Austin Ekeler | 26.65 | 70 | 200 | 16.53 | 21.49 | 12.97 | 8.5221305 | 3 | |
| Leonard Fournette | 26.98 | 72 | 228 | 11.00 | 18.26 | 12.50 | 5.7521977 | 4 | |
| Christian McCaffrey | 25.59 | 71 | 205 | 30.13 | 18.21 | 21.40 | -3.1824609 | 5 | |
| Alvin Kamara | 26.46 | 70 | 215 | 25.19 | 18.05 | 20.19 | -2.1385822 | 6 | |
| Joe Mixon | 25.46 | 73 | 220 | 16.60 | 17.99 | 14.20 | 3.7977979 | 7 | |
| D'Andre Swift | 22.99 | 69 | 211 | 14.60 | 16.07 | 12.54 | 3.5327777 | 8 | |
| Dalvin Cook | 26.42 | 70 | 210 | 24.13 | 15.87 | 18.89 | -3.0174216 | 9 | |
| Nick Chubb | 26.04 | 71 | 227 | 17.31 | 15.38 | 16.38 | -1.0023375 | 10 | |
| Aaron Jones | 27.10 | 69 | 208 | 18.49 | 15.27 | 15.67 | -0.4079820 | 11 | |
| Josh Jacobs | 23.91 | 70 | 220 | 15.42 | 15.07 | 14.09 | 0.9732595 | 12 | |
| David Montgomery | 24.59 | 71 | 224 | 17.65 | 15.00 | 15.82 | -0.8223751 | 13 | |
| Ezekiel Elliott | 26.47 | 72 | 228 | 14.91 | 14.83 | 14.79 | 0.0345592 | 14 | |
| Antonio Gibson | 23.55 | 74 | 220 | 14.44 | 14.32 | 11.99 | 2.3247933 | 15 | |
| Damien Harris | 24.91 | 71 | 213 | 9.13 | 14.01 | 9.22 | 4.7832407 | 16 | |
| Kareem Hunt | 26.43 | 71 | 216 | 13.66 | 13.75 | 12.82 | 0.9305506 | 17 | |
| Darrell Henderson | 24.39 | 68 | 208 | 8.66 | 13.62 | 9.20 | 4.4149502 | 18 | |
| James Robinson | 23.42 | 69 | 219 | 17.89 | 12.42 | 15.71 | -3.2886861 | 19 | |
| Melvin Gordon | 28.74 | 73 | 215 | 13.23 | 12.19 | 12.35 | -0.1575022 | 20 | |
| Chris Carson | 27.32 | 71 | 222 | 15.65 | 12.03 | 15.06 | -3.0349596 | 21 | |
| Chase Edmonds | 25.74 | 69 | 210 | 10.50 | 11.94 | 10.61 | 1.3296646 | 22 | |
| Clyde Edwards-Helaire | 22.75 | 68 | 209 | 13.54 | 11.76 | 11.91 | -0.1483940 | 23 | |
| Devin Singletary | 24.35 | 67 | 203 | 8.97 | 11.64 | 9.10 | 2.5388658 | 24 | |
#gtsave("RB_summary.png")
Based on the summary table, it looks like the model does decently well on the players that arent on the very high end. We have some pretty high residuals in the top 5. For future analyses, I would like to consider independent variables such as salary and what round a player was drafted in during the NFL draft. I think these might improve the performance of the model. I would consider this model as more of a starting point in the area of predicting FPPG rather than something I would rely on when drafting a fantasy football team. RB is a high variance position from year to year and I would like to continue looking into better models in the future. Finally, to wrap up the RB portion of this project, I will create a table that displays predictions for 2022 based on the model I created.
#Getting 2021 variables used in model
rb_predictions_22 <- RB_names21 %>%
filter(nameteam %in% stats_2021$nameteam) %>%
left_join(stats_2021, by = "nameteam") %>%
filter(carries > 20) %>%
arrange(-fantasy_points_ppr_21) %>%
mutate(fantasy_PPG_21 = fantasy_points_ppr_21 / games)
#New age/height/weight df with ages as of the end of the 2022 season
RB_age_ht_wt_22 <- fast_scraper_roster(2021) %>%
filter(position == "RB", full_name %in% rb_predictions_22$full_name) %>%
select(full_name, birth_date, height, weight) %>%
drop_na(birth_date) %>%
mutate(age = age_calc(birth_date, enddate = as.Date("2023-01-08"), units = "years", precise = TRUE),
height = as.numeric(height),
weight = as.numeric(weight))
#Join age/height/weight to predictions DF
rb_predictions_22 <- rb_predictions_22 %>%
left_join(RB_age_ht_wt_22, by = "full_name") %>%
drop_na(fantasy_PPG_21) %>%
rename(fantasy_PPG_last_yr = fantasy_PPG_21)
rb_predictions_22$predicted_22_FPPG <- predict(RB_model6, newdata = rb_predictions_22)
rb_predictions_22 <- rb_predictions_22 %>%
arrange(-predicted_22_FPPG)
rb_predictions_22 %>%
mutate(predicted_FPPG_rank = rank(-predicted_22_FPPG)) %>%
filter(predicted_FPPG_rank <= 24) %>%
select(full_name, team_logo_espn, age, height, weight, fantasy_PPG_last_yr, predicted_22_FPPG, predicted_FPPG_rank) %>%
gt() %>%
gtExtras::gt_img_rows(team_logo_espn) %>%
cols_label(full_name = "Player", team_logo_espn = "Team", age = "Age", height = "Height (inches)", weight = "Weight (pounds)", fantasy_PPG_last_yr = "2021 FPPG", predicted_22_FPPG = "Predicted 2022 FPPG", predicted_FPPG_rank = "Predicted FPPG Ranking") %>%
fmt_number(c(age, height, weight, fantasy_PPG_last_yr, predicted_22_FPPG),
decimals = 2) %>%
tab_header(title = "2022 RB Prediction Summary", subtitle = "Data: nflfastR")
| 2022 RB Prediction Summary | |||||||
|---|---|---|---|---|---|---|---|
| Data: nflfastR | |||||||
| Player | Team | Age | Height (inches) | Weight (pounds) | 2021 FPPG | Predicted 2022 FPPG | Predicted FPPG Ranking |
| Derrick Henry | 29.01 | 75.00 | 247.00 | 24.16 | 22.59 | 1 | |
| Jonathan Taylor | 23.97 | 70.00 | 226.00 | 21.95 | 18.92 | 2 | |
| Leonard Fournette | 27.97 | 72.00 | 228.00 | 18.26 | 17.27 | 3 | |
| James Conner | 27.68 | 73.00 | 233.00 | 17.18 | 16.81 | 4 | |
| Austin Ekeler | 27.65 | 70.00 | 200.00 | 21.49 | 16.31 | 5 | |
| Najee Harris | 24.84 | 73.00 | 232.00 | 17.69 | 16.23 | 6 | |
| Alvin Kamara | 27.46 | 70.00 | 215.00 | 18.05 | 16.04 | 7 | |
| Nick Chubb | 27.03 | 71.00 | 227.00 | 15.38 | 15.45 | 8 | |
| Joe Mixon | 26.46 | 73.00 | 220.00 | 17.99 | 15.33 | 9 | |
| Ezekiel Elliott | 27.47 | 72.00 | 228.00 | 14.83 | 15.01 | 10 | |
| David Montgomery | 25.59 | 71.00 | 224.00 | 15.00 | 14.45 | 11 | |
| Cordarrelle Patterson | 31.81 | 74.00 | 220.00 | 14.66 | 14.36 | 12 | |
| Christian McCaffrey | 26.59 | 71.00 | 205.00 | 18.21 | 14.28 | 13 | |
| AJ Dillon | 24.69 | 72.00 | 247.00 | 10.92 | 14.26 | 14 | |
| Josh Jacobs | 24.91 | 70.00 | 220.00 | 15.07 | 14.14 | 15 | |
| Dalvin Cook | 27.41 | 70.00 | 210.00 | 15.87 | 14.04 | 16 | |
| Aaron Jones | 28.10 | 69.00 | 208.00 | 15.27 | 13.94 | 17 | |
| D'Andre Swift | 23.98 | 69.00 | 211.00 | 16.07 | 13.72 | 18 | |
| Saquon Barkley | 25.91 | 71.00 | 233.00 | 11.43 | 13.47 | 19 | |
| Kareem Hunt | 27.42 | 71.00 | 216.00 | 13.75 | 13.15 | 20 | |
| Chris Carson | 28.31 | 71.00 | 222.00 | 12.03 | 13.08 | 21 | |
| D'Onta Foreman | 26.71 | 73.00 | 236.00 | 10.43 | 12.75 | 22 | |
| James Robinson | 24.42 | 69.00 | 219.00 | 12.42 | 12.59 | 23 | |
| Rashaad Penny | 26.93 | 71.00 | 220.00 | 12.17 | 12.55 | 24 | |
Now that we made a few models and drew conclusions for the RB position, I will start making some models for the WR position. Let’s start out with a null model again as a starting point. The null model will predict the average FPPG for every WR in 2021.
WR_model0 <- lm(fantasy_PPG_21 ~ 1, data = wrstats_2020)
summary(WR_model0)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ 1, data = wrstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.766 -3.585 -0.224 3.050 13.701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.1523 0.5996 20.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.566 on 57 degrees of freedom
WR_MAE0 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model0$fit)
WR_MAE0
## [1] 3.688751
The WR position is more consistent than the RB position, so it makes sense that the null model does better here than the RB null model. Let’s add some independent variables to see how much we can improve. I will go through the same progression as I did with the RB models. First, we will see how much improvement we get by adding in 2020 FPPG.
WR_model1 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr, data = wrstats_2020)
summary(WR_model1)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr, data = wrstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4977 -1.6883 -0.5607 1.4256 12.8069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0222 1.5091 2.003 0.0501 .
## fantasy_PPG_last_yr 0.7136 0.1123 6.354 4.01e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.512 on 56 degrees of freedom
## Multiple R-squared: 0.4189, Adjusted R-squared: 0.4085
## F-statistic: 40.37 on 1 and 56 DF, p-value: 4.013e-08
WR_MAE1 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model1$fit)
WR_MAE1
## [1] 2.447401
By adding in 2020 FPPG as an independent variable, our MAE dropped all the way down to ~2.5. This is already lower than our lowest MAE RB model. Let’s see how low we can get our MAE for the WR position while maintaining an explainable model.
WR_model2 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + receptionsPG_20, data = wrstats_2020)
summary(WR_model2)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + receptionsPG_20,
## data = wrstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2294 -1.8281 -0.3864 1.1867 11.5031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6751 1.5100 1.772 0.082 .
## fantasy_PPG_last_yr 0.3656 0.2565 1.426 0.160
## receptionsPG_20 1.0661 0.7083 1.505 0.138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.473 on 55 degrees of freedom
## Multiple R-squared: 0.4419, Adjusted R-squared: 0.4216
## F-statistic: 21.78 on 2 and 55 DF, p-value: 1.082e-07
MAE_WR2 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model2$fit)
MAE_WR2
## [1] 2.426667
Not much improvement from the previous model when we add a receptions term to the model. Let’s try a few more models.
WR_model3 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + receptionsPG_20 + receiving_yards_PG_20, data = wrstats_2020)
summary(WR_model3)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + receptionsPG_20 +
## receiving_yards_PG_20, data = wrstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0656 -1.6473 -0.4170 0.8501 11.4920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.87250 1.59010 1.806 0.0764 .
## fantasy_PPG_last_yr 0.48484 0.38041 1.275 0.2079
## receptionsPG_20 1.14686 0.73825 1.553 0.1261
## receiving_yards_PG_20 -0.03628 0.08498 -0.427 0.6711
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.499 on 54 degrees of freedom
## Multiple R-squared: 0.4438, Adjusted R-squared: 0.4129
## F-statistic: 14.36 on 3 and 54 DF, p-value: 5.352e-07
MAE_WR3 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model3$fit)
MAE_WR3
## [1] 2.396826
Similar to what we experienced with the RB models, we are starting to see some multicollinearty present in the model (negative coefficient on receving yards is the opposite of what we would expect).
WR_model4 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + receptionsPG_20 + receiving_yards_PG_20 + age + height + weight, data = wrstats_2020)
summary(WR_model4)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + receptionsPG_20 +
## receiving_yards_PG_20 + age + height + weight, data = wrstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0667 -1.6010 -0.2347 1.5335 11.2675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.813706 19.183292 -0.042 0.9663
## fantasy_PPG_last_yr 0.596657 0.385646 1.547 0.1280
## receptionsPG_20 1.820403 0.790075 2.304 0.0253 *
## receiving_yards_PG_20 -0.096128 0.089737 -1.071 0.2891
## age -0.410109 0.216326 -1.896 0.0637 .
## height 0.173058 0.332530 0.520 0.6050
## weight 0.004833 0.046561 0.104 0.9177
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.441 on 51 degrees of freedom
## Multiple R-squared: 0.4918, Adjusted R-squared: 0.432
## F-statistic: 8.225 on 6 and 51 DF, p-value: 3.043e-06
MAE_WR4 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model4$fit)
MAE_WR4
## [1] 2.240891
A slight improvement on MAE, but nothing drastic. Like we did with the RB model, I will now drop the terms that appear to be highly correlated and see what the result is.
WR_model5 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + age + height + weight, data = wrstats_2020)
summary(WR_model5)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + age + height +
## weight, data = wrstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3609 -2.0131 -0.4626 1.3544 13.1240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.747e+00 1.943e+01 0.141 0.888
## fantasy_PPG_last_yr 7.691e-01 1.228e-01 6.264 6.85e-08 ***
## age -2.483e-01 2.108e-01 -1.178 0.244
## height 8.464e-02 3.307e-01 0.256 0.799
## weight 4.088e-05 4.726e-02 0.001 0.999
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.55 on 53 degrees of freedom
## Multiple R-squared: 0.4379, Adjusted R-squared: 0.3955
## F-statistic: 10.32 on 4 and 53 DF, p-value: 2.957e-06
MAE_WR5 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model5$fit)
MAE_WR5
## [1] 2.395658
There are a few more variables I wanted to check out before we move into summarizing our findings. One of those is air yards, which I described at the beginning of this project. I am going to take a look at air yards per game as well as air yards share for the entire season.
wrstats_2020$receiving_air_yards_PG_last_year <- wrstats_2020$receiving_air_yards / wrstats_2020$games
WR_model6 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + receiving_air_yards_PG_last_year + age + height + weight, data = wrstats_2020)
summary(WR_model6)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + receiving_air_yards_PG_last_year +
## age + height + weight, data = wrstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0785 -1.7348 -0.1614 1.3743 10.8820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.50057 17.76509 -0.535 0.595076
## fantasy_PPG_last_yr 1.21590 0.16367 7.429 1.02e-09 ***
## receiving_air_yards_PG_last_year -0.10441 0.02825 -3.696 0.000528 ***
## age -0.43065 0.19570 -2.201 0.032230 *
## height 0.48156 0.31595 1.524 0.133525
## weight -0.05046 0.04461 -1.131 0.263159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.19 on 52 degrees of freedom
## Multiple R-squared: 0.5548, Adjusted R-squared: 0.512
## F-statistic: 12.96 on 5 and 52 DF, p-value: 3.366e-08
MAE_WR6 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model6$fit)
MAE_WR6
## [1] 2.186923
Another metric that is included in my data set is air yards share, the percentage of the teams air yards a WR receives. Let’s see if including this term helps.
WR_model7 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share + age + height + weight, data = wrstats_2020)
summary(WR_model7)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share +
## age + height + weight, data = wrstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5535 -1.6997 -0.3455 1.2769 11.8268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.83272 17.92105 -0.214 0.83149
## fantasy_PPG_last_yr 1.08887 0.14810 7.352 1.36e-09 ***
## air_yards_share -21.91462 6.59459 -3.323 0.00164 **
## age -0.37338 0.19689 -1.896 0.06347 .
## height 0.31799 0.31125 1.022 0.31167
## weight -0.02687 0.04409 -0.609 0.54491
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.255 on 52 degrees of freedom
## Multiple R-squared: 0.5363, Adjusted R-squared: 0.4918
## F-statistic: 12.03 on 5 and 52 DF, p-value: 9.23e-08
MAE_WR7 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model7$fit)
MAE_WR7
## [1] 2.225001
The air yards term is statistically significant. Models 6 and 7 have almost identical performance, but I will use model 7, because I prefer the air yards share metric which is more widely used than air yards per game. The model is likely dealing with multicollinearity, considering that the coefficient on air yards share is negative. We would intuitively expect it to be positive. It would make sense for 2020 FPPG to be correlated with 2020 air yards share. One more variable I would like to try is target share, which was described at the beginning of this document.
WR_model8 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share + target_share + age + height + weight, data = wrstats_2020)
summary(WR_model8)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share +
## target_share + age + height + weight, data = wrstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9117 -1.4450 -0.3325 1.4170 10.8643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.93790 17.36739 -0.572 0.569690
## fantasy_PPG_last_yr 0.77797 0.19329 4.025 0.000189 ***
## air_yards_share -29.42217 7.06988 -4.162 0.000122 ***
## target_share 36.14456 15.25428 2.369 0.021636 *
## age -0.44456 0.19107 -2.327 0.023994 *
## height 0.40619 0.30061 1.351 0.182596
## weight -0.02558 0.04225 -0.605 0.547630
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.12 on 51 degrees of freedom
## Multiple R-squared: 0.5823, Adjusted R-squared: 0.5332
## F-statistic: 11.85 on 6 and 51 DF, p-value: 2.797e-08
MAE_WR8 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model8$fit)
MAE_WR8
## [1] 2.130593
Target share ended up being a statistically significant variable. Let’s try dropping height and weight, which aren’t considered significant (P-value above 0.05).
WR_model9 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share + target_share + age, data = wrstats_2020)
summary(WR_model9)
##
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share +
## target_share + age, data = wrstats_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8350 -1.8512 -0.1926 1.6127 11.4830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.0509 4.7880 3.143 0.002734 **
## fantasy_PPG_last_yr 0.7902 0.1908 4.141 0.000125 ***
## air_yards_share -26.7250 6.8397 -3.907 0.000266 ***
## target_share 31.9562 14.9615 2.136 0.037326 *
## age -0.4677 0.1895 -2.468 0.016836 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.127 on 53 degrees of freedom
## Multiple R-squared: 0.5639, Adjusted R-squared: 0.531
## F-statistic: 17.13 on 4 and 53 DF, p-value: 4.481e-09
MAE_WR9 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model9$fit)
MAE_WR9
## [1] 2.13011
There is little difference in the performance of WR models 6-9, but it is preferred to eliminate the insignificant terms if we can maintain or even improve the model.
We did not end up getting the same result as we did for the RB models where our best model was the one with the correlated terms dropped. In further analyses we could hopefully discover some more useful independent variables. This is at at least a starting point. Let’s take a look at the actual vs. predicted and constant variance plot for our best WR model, model 9.
best_WR_model <- data.frame(wrstats_2020$fantasy_PPG_21, predict(WR_model9))
names(best_WR_model) <- c("Actual", "Predicted")
ggplot(best_WR_model) +
geom_point(aes(x = Actual, y = Predicted)) +
geom_abline() +
labs(y = "Predicted Values", x = "Actual Values", title = "Actual Fantasy PPG vs. Predicted Fantasy PPG (WRs)") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
Not too bad, but it looks like we might have some pretty large residuals at the high end like we did with the RB position. Let’s take a look at the constant variance plot before we create a summary table.
ggplot(data=WR_model9, aes(x = .fitted, y = .resid))+
geom_point() +
ggtitle("Checking for Constant Variance") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
The constant variance assumption seems reasonably satisfied. Let’s take a look at the summary table as we did for the RB position.
wrstats_2020$predicted_fantasy_PPG_21 <- predict(WR_model9)
wrstats_2020$residual <- wrstats_2020$fantasy_PPG_21 - wrstats_2020$predicted_fantasy_PPG_21
wrstats_2020$PPG_21_rank <- rank(-wrstats_2020$fantasy_PPG_21)
WR_summary_table <- wrstats_2020 %>%
filter(PPG_21_rank <= 24) %>%
select(full_name, team_logo_espn, age, target_share, air_yards_share, fantasy_PPG_last_yr, fantasy_PPG_21, predicted_fantasy_PPG_21, residual, PPG_21_rank) %>%
arrange(-fantasy_PPG_21) %>%
gt() %>%
gtExtras::gt_img_rows(team_logo_espn) %>%
cols_label(full_name = "Player", team_logo_espn = "Team", age = "Age", target_share = "2020 Target Share", air_yards_share = "2020 Air Yards Share", fantasy_PPG_last_yr = "2020 FPPG", fantasy_PPG_21 = "2021 FPPG", predicted_fantasy_PPG_21 = "Predicted 2021 FPPG", residual = "Residual", PPG_21_rank = "2021 FPPG Ranking") %>%
fmt_number(c(age, fantasy_PPG_last_yr, fantasy_PPG_21, predicted_fantasy_PPG_21),
decimals = 2) %>%
fmt_percent(columns = c(air_yards_share, target_share),
decimals = 2) %>%
tab_header(title = "WR Model Summary", subtitle = "Table includes top 24 WRs (non-rookies) in 2021 FPPG (PPR) | Data: nflfastR")
#Applying conditional formatting to residual column
WR_summary_table %>%
data_color(
columns = c(residual),
colors = scales::col_numeric(
c("#f87274", "#FFFFFF", "#f87274"),
domain = c(-12, 12)
)
) # %>%
| WR Model Summary | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Table includes top 24 WRs (non-rookies) in 2021 FPPG (PPR) | Data: nflfastR | |||||||||
| Player | Team | Age | 2020 Target Share | 2020 Air Yards Share | 2020 FPPG | 2021 FPPG | Predicted 2021 FPPG | Residual | 2021 FPPG Ranking |
| Cooper Kupp | 28.57 | 22.93% | 21.50% | 14.05 | 25.85 | 14.37 | 11.483001538 | 1 | |
| Davante Adams | 29.05 | 34.00% | 39.60% | 25.60 | 21.52 | 21.98 | -0.459486269 | 2 | |
| Deebo Samuel | 25.98 | 21.71% | 3.66% | 11.53 | 21.19 | 17.97 | 3.217789402 | 3 | |
| Justin Jefferson | 22.57 | 25.30% | 39.23% | 17.14 | 19.44 | 15.64 | 3.796808483 | 4 | |
| Tyreek Hill | 27.86 | 23.28% | 35.62% | 21.93 | 17.44 | 17.27 | 0.173765273 | 5 | |
| Chris Godwin | 25.87 | 18.83% | 19.86% | 15.92 | 17.31 | 16.24 | 1.072774517 | 6 | |
| Diontae Johnson | 25.52 | 22.19% | 24.87% | 14.92 | 17.15 | 15.35 | 1.798857570 | 7 | |
| Stefon Diggs | 28.11 | 29.91% | 34.89% | 20.54 | 16.79 | 18.36 | -1.570674942 | 8 | |
| Mike Evans | 28.39 | 17.88% | 23.16% | 15.54 | 16.41 | 13.58 | 2.830028671 | 9 | |
| Keenan Allen | 29.70 | 26.69% | 25.95% | 17.51 | 16.11 | 16.59 | -0.474374558 | 10 | |
| Tee Higgins | 22.98 | 21.77% | 28.17% | 12.97 | 15.65 | 13.98 | 1.666084485 | 11 | |
| Mike Williams | 27.27 | 16.57% | 30.20% | 10.98 | 15.41 | 8.20 | 7.214989840 | 12 | |
| Adam Thielen | 31.38 | 25.56% | 34.16% | 16.93 | 15.37 | 12.79 | 2.578747025 | 13 | |
| Robert Woods | 29.75 | 23.33% | 23.33% | 15.19 | 15.24 | 14.36 | 0.881631493 | 14 | |
| Hunter Renfrow | 26.05 | 14.33% | 12.46% | 8.22 | 15.24 | 10.62 | 4.625530133 | 15 | |
| Tyler Lockett | 29.28 | 24.30% | 28.12% | 16.59 | 15.09 | 14.71 | 0.375857925 | 16 | |
| DeAndre Hopkins | 29.60 | 29.51% | 34.10% | 17.99 | 14.72 | 15.74 | -1.021729388 | 17 | |
| CeeDee Lamb | 22.76 | 18.36% | 23.17% | 13.61 | 14.55 | 14.83 | -0.284467118 | 18 | |
| Brandin Cooks | 28.29 | 23.69% | 30.10% | 15.47 | 14.49 | 13.57 | 0.920606241 | 19 | |
| DK Metcalf | 24.07 | 24.62% | 40.66% | 16.96 | 14.37 | 14.19 | 0.178292241 | 20 | |
| Calvin Ridley | 27.05 | 25.10% | 41.23% | 18.77 | 14.22 | 14.23 | -0.007331846 | 21 | |
| Marquise Brown | 24.60 | 26.82% | 38.75% | 11.44 | 14.14 | 10.80 | 3.346783179 | 22 | |
| Michael Pittman | 24.26 | 14.28% | 15.11% | 7.61 | 14.04 | 10.24 | 3.796868846 | 23 | |
| A.J. Brown | 24.53 | 27.23% | 35.50% | 17.68 | 13.92 | 16.76 | -2.846501853 | 24 | |
# gtsave("WR_summary.png")
One limitation of my model is that it does not include rookies since I am using 2020 data to predict 2021. In the future, I would like to either create a separate model to account for rookies, or figure out a way to incorporate rookies into the model with the other players. Like we did for the RB position, I will now create a table of predictions for the upcoming season based on 2021 data.
#Getting 2021 variables used in model
wr_predictions_22 <- WR_names21 %>%
filter(nameteam %in% stats_2021$nameteam) %>%
left_join(stats_2021, by = "nameteam") %>%
filter(receptions > 20) %>%
arrange(-fantasy_points_ppr_21) %>%
mutate(fantasy_PPG_21 = fantasy_points_ppr_21 / games)
#New age/height/weight df with ages as of the end of the 2022 season
WR_age_ht_wt_22 <- fast_scraper_roster(2021) %>%
filter(position == "WR", full_name %in% wr_predictions_22$full_name) %>%
select(full_name, birth_date, height, weight) %>%
drop_na(birth_date) %>%
mutate(age = age_calc(birth_date, enddate = as.Date("2023-01-08"), units = "years", precise = TRUE),
height = as.numeric(height),
weight = as.numeric(weight))
#Join age/height/weight to predictions DF
wr_predictions_22 <- wr_predictions_22 %>%
left_join(WR_age_ht_wt_22, by = "full_name") %>%
drop_na(fantasy_PPG_21) %>%
rename(fantasy_PPG_last_yr = fantasy_PPG_21)
# wr_predictions_22$receiving_air_yards_PG_last_year <- wr_predictions_22$receiving_air_yards / wr_predictions_22$games
wr_predictions_22$predicted_22_FPPG <- predict(WR_model9, newdata = wr_predictions_22)
wr_predictions_22 <- wr_predictions_22 %>%
arrange(-predicted_22_FPPG)
wr_predictions_22 %>%
mutate(predicted_FPPG_rank = rank(-predicted_22_FPPG)) %>%
filter(predicted_FPPG_rank <= 24) %>%
select(full_name, team_logo_espn, age, target_share, air_yards_share, fantasy_PPG_last_yr, predicted_22_FPPG, predicted_FPPG_rank) %>%
gt() %>%
gtExtras::gt_img_rows(team_logo_espn) %>%
cols_label(full_name = "Player", team_logo_espn = "Team", age = "Age", target_share = "2021 Target Share", air_yards_share = "2021 Air Yards Share", fantasy_PPG_last_yr = "2021 FPPG", predicted_22_FPPG = "Predicted 2022 FPPG", predicted_FPPG_rank = "Predicted FPPG Ranking") %>%
fmt_number(c(age, fantasy_PPG_last_yr, predicted_22_FPPG),
decimals = 2) %>%
fmt_percent(c(air_yards_share, target_share),
decimals = 2) %>%
tab_header(title = "2022 WR Prediction Summary", subtitle = "Data: nflfastR")
| 2022 WR Prediction Summary | |||||||
|---|---|---|---|---|---|---|---|
| Data: nflfastR | |||||||
| Player | Team | Age | 2021 Target Share | 2021 Air Yards Share | 2021 FPPG | Predicted 2022 FPPG | Predicted FPPG Ranking |
| Cooper Kupp | 29.57 | 32.33% | 33.18% | 25.85 | 23.11 | 1 | |
| Deebo Samuel | 26.98 | 26.21% | 28.91% | 21.19 | 19.82 | 2 | |
| Davante Adams | 30.04 | 31.68% | 37.02% | 21.52 | 18.23 | 3 | |
| Chris Godwin | 26.86 | 21.11% | 19.55% | 17.31 | 17.69 | 4 | |
| Jaylen Waddle | 24.12 | 24.89% | 25.97% | 15.49 | 17.02 | 5 | |
| Amon-Ra St. Brown | 23.21 | 22.84% | 22.57% | 14.21 | 16.69 | 6 | |
| Hunter Renfrow | 27.05 | 21.05% | 17.15% | 15.24 | 16.59 | 7 | |
| Justin Jefferson | 23.56 | 29.41% | 45.82% | 19.44 | 16.54 | 8 | |
| Diontae Johnson | 26.51 | 27.90% | 33.60% | 17.15 | 16.14 | 9 | |
| Ja'Marr Chase | 22.85 | 23.23% | 36.90% | 17.92 | 16.08 | 10 | |
| Rondale Moore | 22.58 | 14.32% | 1.78% | 7.94 | 14.86 | 11 | |
| CeeDee Lamb | 23.75 | 20.08% | 26.17% | 14.55 | 14.86 | 12 | |
| Tee Higgins | 23.97 | 23.52% | 33.90% | 15.65 | 14.66 | 13 | |
| Marquise Brown | 25.60 | 26.30% | 31.19% | 14.14 | 14.32 | 14 | |
| Robert Woods | 30.75 | 21.77% | 21.25% | 15.24 | 13.99 | 15 | |
| Michael Pittman | 25.26 | 26.12% | 32.53% | 14.04 | 13.98 | 16 | |
| Stefon Diggs | 29.11 | 27.07% | 35.33% | 16.79 | 13.92 | 17 | |
| Tyreek Hill | 28.86 | 24.79% | 35.33% | 17.44 | 13.82 | 18 | |
| Keenan Allen | 30.70 | 25.69% | 30.35% | 16.11 | 13.52 | 19 | |
| DK Metcalf | 25.07 | 27.74% | 38.63% | 14.37 | 13.22 | 20 | |
| Kadarius Toney | 23.95 | 17.29% | 14.54% | 9.15 | 12.72 | 21 | |
| Darnell Mooney | 25.19 | 27.25% | 35.44% | 12.92 | 12.71 | 22 | |
| Mike Williams | 28.26 | 20.11% | 29.52% | 15.41 | 12.55 | 23 | |
| Elijah Moore | 22.78 | 19.53% | 30.24% | 12.56 | 12.48 | 24 | |
Similar to what was said about the RB position, I would not consider this model the golden ticket to perfectly drafting the WR position in fantasy football. It does better than the RB model, but I would still consider it more of a starting point. I would like to look into more independent variables in the future. I also plan to come back to this model after the 2022 NFL season to see how it did.
Overall, this project was a great learning experience. In my free time, I am interested in getting further into Sports Analytics. This project was a great starting point for me in that area. I got more comfortable with the nflfastR package which is very convenient to conduct quick NFL analysis, I improved my skills in cleaning/preparing data with dplyr, plotting in ggplot, and I became familiar with creating regression models on sports data. I also enjoyed learning the gt package to create easy to read summary tables. These are convenient for sharing with friends or social media. I consider the models that I created as starting points. I wouldn’t necessarily use them as my key decision making factor in a fantasy draft, but they did lead to interesting conclusions and a good learning experience.